1 Introduction

Single-cell RNA sequencing (scRNA-seq) has revolutionized biological research by enabling the measurement of transcriptomic profiles at the single-cell level. Appropriately clustering cells on scRNA-seq data with multiple subjects could be a problem in larger scale studies. Subject-specific variation is the most challenging one. It could have a significant impact on clustering accuracy when systematic heterogeneity from multiple subjects exists. Though there are several methods seeking to address such effects most of them suffer from several limitations.

EDClust is designed for multi-subject scRNA-seq cell clustering. EDClust adopts a Dirichlet-multinomial mixture model and explicitly accounts for cell type heterogeneity, subject heterogeneity, and clustering uncertainty. EDClust pipeline mainly includes three steps: (A). input data. (B). initialize parameters. (C). clustering. Based on an EM and MM hybrid framework, EDClust offers functions for predicting cell type labels, estimating parameters of effects from different sources, and posterior probabilities for cells being in each cluster.

2 Installation and help

2.1 Install EDClust

To install this package, start R and enter:

library(devtools)
install_github("weix21/EDClust")
library(EDClust)

2.2 Install Julia

EDClust requires a working installation of Julia. This can be easily done via:

julia <- setup_julia() 

which will automatically install both Julia and the required Julia packages if they are missing.

If you want to setup Julia manually, you can download a generic binary from https://julialang.org/downloads/. Before using an existing Julia binary, make sure that Julia is found in the path. Then you can do

julia <- setup_julia(path = "the folder that contains Julia binary") 

For more information about Julia setup, please see the julia_setup() function from JuliaCall package, which provides an R interface to Julia.

If you plan to update EDClust to newest version, after updating, please also specify Update = TRUE to update the related EDClust’s Julia package version.

julia <- setup_julia(Update = TRUE) 

2.3 Help for EDClust

Any EDClust questions should be posted to the GitHub Issue section of EDClust https://github.com/weix21/EDClust/issues.

3 Using EDClust for scRNA-seq clustering analysis

EDClust requires a pre-processed count expression matrix with rows representing the genes and columns representing the cells, and an array containing subject ID information. Here’s we show an example in Mlung_sub dataset step by step.

3.1 Setup the package

EDClust utilizes EDClust.jl for its core routines and provides a direct wrapper over EDClust.jl. You need to do initial setup by function setup_julia(). You could also use an existing Julia binary by specifying the path.

library(EDClust)
julia <- setup_julia(path = "the folder that contains Julia binary") 

3.2 Load the data

The Mlung_sub dataset is loaded with three objects including the count expression matrix (count_all_notna), corresponding cell type information (annot_all_notna) and subject ID information (subject_all_notna). This dataset is pre-processed by FEAST, which filters out the genes based on the dropout rate and infers the gene-level significance. Here we obtain the top 500 features for clustering.

data("Mlung_sub")
dim(count_all_notna)
#> [1]  500 1756
table(subject_all_notna)
#> subject_all_notna
#>   1   2   3   4 
#> 395 532 448 381
table(annot_all_notna)
#> annot_all_notna
#>                    Club Cells             Endothelial Cells 
#>                           111                           428 
#>                   Lymphocytes                   Macrophages 
#>                           222                           421 
#>                   Neutrophils Small Airway Epithelial Cells 
#>                           412                           162

3.3 Feature selection

FEAST, as a newly developed feature selection tool, shows superior performance in substantial real data analyses. Therefore EDClust uses FEAST for feature selection and its main functions is embedded in function FEAST_select().

data <- FEAST_select(count, subject, Ncluster = 6, Nfeature = 500)

FEAST bascially provides two functions for feature selection: FEAST() and FEAST_fast(), and FEAST_select() inherits most of their arguments. For extreme large dataset (sample size >5000), FEAST_fast() will be automatically applied. In our real data analyses, we could obtain satisfactory results with only 500 features, and thus we use Nfeature = 500 by default.

3.4 Baseline selection and Initialize parameters

EM algorithm often suffers from locally optimal solutions while EDClust is based an EM and MM hybrid framework. Thus it is necessary to provide appropriate initial values for the parameter estimations, especially the cell type-specific effect \(\alpha_0\). Unsupervised clustering methods will be applied on the selected baseline subject to obtain initial clusters. We recommend using SHARP as the unsupervised clustering method because of its good computational performance. And then EDClust will compute the initial value for \(\alpha_0\) based on the clustering results.

Though commonly baseline subject selections won’t have great impact on clustering, when the selected baseline subject has low signal-to-noise and lead to bad SHAPR clustering, we fail to obtain well informative initial values for the model parameters and therefore EDClust is more likely to be stuck at local optimal solutions. We provide a method to select the best possible subject as the baseline, based on F-test on the SHARP clusters.

baseID <- Baseline_select(count_all_notna, subject_all_notna, Ncluster = 6)

Since subject 2 has the greatest score, we select subject 2 as the baseline subject for parameter initialization.

Using function InitVal_S() will automatically run SHARP and compute the initial value for \(\alpha_0\). You could specify the number of clusters, baseline subject and seed for SHARP.

alpha_0 <- InitVal_S(count_all_notna, subject_all_notna, Ncluster = 6, ID = 2, seed = 2345) 

It is noted that though SHARP is recommended, You could also use function InitVal() to compute the initial values based on clustering results obtained from other methods.

3.5 Clustering

After computing the initial values, we can use function FitPolya() for clustering. The results provides predicted cell label (mem); the log likelihood (loglik); estimations of cell type-specific effects (alpha_0), subject-specific effects (delta) as well as overall effect (alpha); and the probability that each cell belongs to each cluster (p).

result <- FitPolya(count_all_notna, subject_all_notna, alpha_0, BaseID = 2L)

Based on the clustering results, we can benchmark the performance of EDClust by ARI. And The t-SNE plot can also be shown as:

library(mclust)
adjustedRandIndex(result$mem, annot_all_notna)

4 Session Info

sessionInfo()
#> R version 4.1.1 (2021-08-10)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Catalina 10.15.7
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] EDClust_0.2.0    knitr_1.37       BiocStyle_2.20.2
#> 
#> loaded via a namespace (and not attached):
#>   [1] Rtsne_0.15                  colorspace_2.0-2           
#>   [3] ellipsis_0.3.2              class_7.3-19               
#>   [5] clusterCrit_1.2.8           mclust_5.4.8               
#>   [7] XVector_0.32.0              GenomicRanges_1.44.0       
#>   [9] proxy_0.4-26                JuliaCall_0.17.4           
#>  [11] fansi_0.5.0                 mvtnorm_1.1-3              
#>  [13] codetools_0.2-18            splines_4.1.1              
#>  [15] doParallel_1.0.16           robustbase_0.93-9          
#>  [17] jsonlite_1.7.2              cluster_2.1.2              
#>  [19] clues_0.6.2.2               pheatmap_1.0.12            
#>  [21] shiny_1.7.1                 BiocManager_1.30.16        
#>  [23] rrcov_1.6-0                 compiler_4.1.1             
#>  [25] assertthat_0.2.1            Matrix_1.4-0               
#>  [27] fastmap_1.1.0               later_1.3.0                
#>  [29] htmltools_0.5.2             tools_4.1.1                
#>  [31] igraph_1.2.9                gtable_0.3.0               
#>  [33] glue_1.6.0                  GenomeInfoDbData_1.2.6     
#>  [35] dplyr_1.0.7                 doRNG_1.8.2                
#>  [37] Rcpp_1.0.7                  Biobase_2.52.0             
#>  [39] jquerylib_0.1.4             TSCAN_1.30.0               
#>  [41] vctrs_0.3.8                 nlme_3.1-153               
#>  [43] iterators_1.0.13            xfun_0.29                  
#>  [45] stringr_1.4.0               mime_0.12                  
#>  [47] lifecycle_1.0.1             irlba_2.3.5                
#>  [49] rngtools_1.5.2              TrajectoryUtils_1.0.0      
#>  [51] gtools_3.9.2                WriteXLS_6.3.0             
#>  [53] DEoptimR_1.0-9              zlibbioc_1.38.0            
#>  [55] scales_1.1.1                promises_1.2.0.1           
#>  [57] MatrixGenerics_1.4.3        parallel_4.1.1             
#>  [59] SummarizedExperiment_1.22.0 RColorBrewer_1.1-2         
#>  [61] SingleCellExperiment_1.14.1 yaml_2.2.1                 
#>  [63] gridExtra_2.3               ggplot2_3.3.5              
#>  [65] sass_0.4.0                  FEAST_1.0.0                
#>  [67] fastICA_1.2-3               stringi_1.7.6              
#>  [69] highr_0.9                   S4Vectors_0.30.2           
#>  [71] pcaPP_1.9-74                foreach_1.5.1              
#>  [73] e1071_1.7-9                 caTools_1.18.2             
#>  [75] BiocGenerics_0.38.0         BiocParallel_1.26.2        
#>  [77] GenomeInfoDb_1.28.4         rlang_0.4.12               
#>  [79] pkgconfig_2.0.3             matrixStats_0.61.0         
#>  [81] bitops_1.0-7                evaluate_0.14              
#>  [83] lattice_0.20-45             ROCR_1.0-11                
#>  [85] purrr_0.3.4                 tidyselect_1.1.1           
#>  [87] plyr_1.8.6                  magrittr_2.0.1             
#>  [89] bookdown_0.24               R6_2.5.1                   
#>  [91] IRanges_2.26.0              gplots_3.1.1               
#>  [93] generics_0.1.1              combinat_0.0-8             
#>  [95] DelayedArray_0.18.0         DBI_1.1.1                  
#>  [97] pillar_1.6.4                mgcv_1.8-38                
#>  [99] RCurl_1.98-1.5              tibble_3.1.6               
#> [101] crayon_1.4.2                SC3_1.20.0                 
#> [103] KernSmooth_2.23-20          utf8_1.2.2                 
#> [105] rmarkdown_2.11              viridis_0.6.2              
#> [107] grid_4.1.1                  data.table_1.14.2          
#> [109] flashClust_1.01-2           SHARP_1.1.0                
#> [111] digest_0.6.29               xtable_1.8-4               
#> [113] httpuv_1.6.3                stats4_4.1.1               
#> [115] munsell_0.5.0               viridisLite_0.4.0          
#> [117] bslib_0.3.1